Lab 3: Distribution Comparisons and Joint Probabilities
Introduction
In this tutorial you will learn how to be derive basic insights from data and visualize joint probabilities. In particular we will be visualizing distributions, both discrete and continuous, both univariate and bivariate. By the end of the tutorial you will be able to:
- draw and compare distributions two or more distributions
- draw histograms, densities,
- draw box, violin and jitter plots,
- draw 2D histograms and densities, pair plots and joint plots to visualize joint probabilities,
- interpret these visuals.
Getting Started
In this lab we will again visualize House Prices data used once in Kaggle Challenge, which involves prices and some properties of houses in Washington. To be able to follow you need:
tidyverse,GGally,ggridgesandgridExtrapackages (install them if not yet installed)- to download
house_subset.csvdata uploaded on LEARN
If tidyverse installation were unsuccessful, you can instead use ggplot for this lab:
Recall the Data
## [1] 7633 22
## 'data.frame': 7633 obs. of 22 variables:
## $ X : int 2 6 7 16 17 20 22 23 27 31 ...
## $ id : num 6.41e+09 7.24e+09 1.32e+09 9.30e+09 1.88e+09 ...
## $ date : Factor w/ 344 levels "20140502T000000",..: 203 10 52 240 85 326 108 58 51 179 ...
## $ price : num 538000 1225000 257500 650000 395000 ...
## $ bedrooms : int 3 4 3 4 3 3 3 5 3 3 ...
## $ bathrooms : num 2.25 4.5 2.25 3 2 1 2.75 2.5 1.75 2.5 ...
## $ sqft_living : int 2570 5420 1715 2950 1890 1250 3050 2270 2450 2320 ...
## $ sqft_lot : int 7242 101930 6819 5000 14040 9774 44867 6300 2691 3980 ...
## $ floors : num 2 1 2 2 2 1 1 2 2 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 3 0 0 4 0 0 0 ...
## $ condition : int 3 3 3 3 3 4 3 3 3 3 ...
## $ grade : int 7 11 7 9 7 7 9 8 8 8 ...
## $ sqft_above : int 2170 3890 1715 1980 1890 1250 2330 2270 1750 2320 ...
## $ sqft_basement: int 400 1530 0 970 0 0 720 0 700 0 ...
## $ yr_built : int 1951 2001 1995 1979 1994 1969 1968 1995 1915 2003 ...
## $ yr_renovated : int 1991 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98125 98053 98003 98126 98019 98003 98040 98092 98119 98027 ...
## $ lat : num 47.7 47.7 47.3 47.6 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1690 4760 2238 2140 1890 1280 4110 2240 1760 2580 ...
## $ sqft_lot15 : int 7639 101930 6819 4000 14018 8850 20336 7005 3573 3980 ...
Above summerizes the data. There are 7633 houses in the dataset and 22 columns of information. Apart from the price of the house, there is information about number of bathrooms, square feet of living spaces, number of floors, whether or not it is water front, location and so on.
Below we show the unique zipcodes which gives information about how many neighbours are included in the dataset:
## [1] 98125 98053 98003 98126 98019 98040 98092 98119 98027 98001 98070 98105
## [13] 98042 98059 98122 98075 98116 98032 98065 98109 98155 98011 98106 98072
## [25] 98055
We know that some values are not continuous but treated as if so. So we must tell R to read as categorical instead:
Univariate Distributions
There are two types of univariate distributions, continuous and discrete. Let’s start with the univariate continuous one.
There are a number of visuals that you can play with:
- Histograms and densities
- Boxplots, violinplots and jitter plots
- A combination of these
Histograms, Densities and Boxplots
Histograms can tell us about the distribution of the data at hand. As in Review of MSCI 251 file, you can see that normal distribution has a signature bell shape. So when we plot the data, if we can see a similar shape we can say, it seems normally distributed. Let’s see how normal distribution looks like in the window of Histogram and Boxplot:
library('gridExtra') ## Needed for side by side plots
normal <- data.frame(x = rnorm(10000))
p1 <- ggplot(normal, aes(x = x)) +
geom_histogram(fill=adjustcolor('firebrick',.7), color='white') +
theme_classic() +
theme(axis.title = element_blank(), axis.text.y = element_blank()) +
xlim(c(-4,4)) +
ggtitle('Normal Distribution')
p2 <- ggplot(normal, aes(x = x)) +
geom_boxplot(outlier.colour = 'firebrick') +
theme_classic() +
theme(axis.title = element_blank(), axis.text.y = element_blank()) +
xlim(c(-4,4))
p3 <- ggplot(house, aes(x = price)) +
geom_histogram(fill=adjustcolor('firebrick',.7), color='white') +
theme_classic() + xlim(c(0, 6000000)) +
theme(axis.title = element_blank(), axis.text.y = element_blank()) +
ggtitle('Distribution of Price')
p4 <- ggplot(house, aes(x = price)) +
geom_boxplot(outlier.colour = 'firebrick') +
theme_classic() + xlim(c(0, 6000000)) +
theme(axis.title = element_blank(), axis.text.y = element_blank())
grid.arrange(p1,p2,p3,p4, ncol=1)Certanly not normally distributed. It is rightskewed. Also it has too many outliers based on the boxplot which overinterprets outliers.
The price data has its peak around 300-650. This also indicate that median price is around 300-500. Also Boxplot shows the 1st quartile is around 100K and 3rd quartile is around 600K.
Transforming Data
In many cases you will have a non-normal distributed data but you will need normal. This is very common problem and we will visit later. One way to make your data look more normal is to transform it with log or sqrt functions.
p1 <- ggplot(house, aes(x=price)) +
geom_histogram(fill = adjustcolor('darkolivegreen2',0.7), color='white', bins=30) +
ggtitle('Boxplot of Price') + theme_light()
p2 <- ggplot(house, aes(x=log(price))) +
geom_histogram(fill = adjustcolor('firebrick',0.7), color='white', bins=30) +
ggtitle('Boxplot of LOG Price') + theme_light()
p3 <- ggplot(house, aes(x=sqrt(price))) +
geom_histogram(fill = adjustcolor('steelblue',0.7), color='white', bins=30) +
ggtitle('Boxplot of SQRT Price') + theme_light()
grid.arrange(p1,p2, p3, ncol=3)Well, log transformation seems more like the one that we want to see. Let’s continue with log prices instead.
Comparing Univariate Distributions
The price data is a continuous variable and from above we can say that it is not normally distributed. But maybe it is not common in whole data, maybe there is a factor that will change the picture.
Let’s explore price more using histograms, but this time visualize the counts but the density (proportions) on the y axis:
ggplot(house, aes(x=price, y=stat(density))) +
geom_histogram(fill = adjustcolor('firebrick',0.7), color='white', bins=30) +
ggtitle('Histogram of Price') + theme_light()Overlaying Histograms
Houses are not expensive everywhere. For example if we want to compare particular regions, e.g. 98040, 98105 and 98116 we can subset these regions overlay histograms:
ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) +
geom_histogram(alpha=.5, colour='white', position='identity', bins = 20) +
theme_bw()There are roughly 3.5 bins placed in the range of 0 - 1,000,000. The median of 98116 and 98105 are around 300 - 600K and there are many houses in this range in 98116. Also the red bins are above the others for higher prices, which translates into 98040 is overally expensive.
We may wonder whether or not waterfront matters:
ggplot(house, aes(x=price, y=..density.., fill=waterfront)) +
geom_histogram(alpha=.5, colour='white', bins=20, position = 'identity')Similarly waterfront houses are way more expensive than others.
Overlaying Densities
Another way to visualize is to use density plots. Here density is empirical estimate to probability density function:
ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) +
geom_density(alpha=.6) +
theme_bw()Similarly we can ask if waterfront houses are more expensive:
Facet Wrap
Facet wraps are powerfull in comparing few distributions. Let’s compare 98040, 98105 and 98116 using facet wraps:
ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) +
geom_histogram(alpha=.5, colour='white') +
facet_wrap( ~ zipcode, dir = 'v') +
theme_bw()98040 spans a wider range of price whereas 98116 is relatively cheaper. Both 98116 and 98105 have peaks in range of 400K - 600K, so most of the houses are in this range. Median of 98040 is around 800K - 1 million, whereas the other two region is around 400-600K.
Back to Back Plot
Overlapping gives good information when comparing two distributions but a better way can be back2back plot. But I must admit that it is not easy to plot it. We will do it by
- ploting the two group separately by subsetting the data
- flipping direction of one of the data by multiplying with -1 as:
-..count..or-..density..
Let’s compare houses in zipcode == 98040 and zipcode == 98105
ggplot(house) +
geom_histogram(data= subset(house, zipcode == 98040), aes(x = log(price), y = ..count..), fill="firebrick", colour='white') +
geom_histogram(data= subset(house, zipcode == 98105), aes(x = log(price), y = -..count..), fill= "steelblue", colour='white')We can see both of them are expensive regions but the houses in 98105 is overally cheaper as the distribution is more right-skewed. The red data piles up (and therefore its median is) around 13.60 compared to the blue one that piles up around 13.20.
Human’s eye catches the asymmetry better than anything else. Guess why? So a better practice is to flip coordinates:
ggplot(house, aes(x=price)) +
geom_histogram(data= subset(house, zipcode == 98040), aes(y = log(price), x = ..count..), fill="firebrick", colour='white') +
geom_histogram(data= subset(house, zipcode == 98105), aes(y = log(price), x = -..count..), fill= "steelblue", colour='white') +
annotate('text', x = -20, y = 15.25, label='Mercer Island') +
annotate('text', x = 20, y = 15.25, label='University District') +
ggtitle('Comparison of LOG pries') +
theme_minimal()Comparing many distributions
Density
ggplot(house, aes(x=log(price), fill=bedrooms)) +
geom_density(aes(y=..density..),alpha=.3) +
theme_minimal()## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
Histogram
ggplot(house, aes(x=log(price), fill=bedrooms)) +
geom_histogram(aes(y=..density..),alpha=.6) +
theme_minimal()Reminds me a song a song.
It is interesting to see how many houses have 9 bedrooms! The density plot is a better choice than histogram to compare more than 3 series.
Faceting
As you can see above it is becoming harder to compare precisely the distributions. One way to deal with this problem is using facet_wrap. Let’s see if the distribution is different among different locations:
Let’s check neighbourhood-wise:
ggplot(house, aes(x=log(price), color = zipcode)) +
geom_histogram() +
ggtitle('Boxplot of LOG Price') +
facet_wrap(~zipcode, ncol=5) + # try adding scales = 'free'
theme_minimal() +
theme(axis.text = element_blank(),
legend.position = 'none')Log values can transform the data into more normal shape. There are many neighbourhoods who have house log prices normally distributed.
Boxplots
Boxplot is a solution to above problem. It summarizes important aspects and visualizes condensed so we can compare many series. Take for example boxplot of LOG prices:
ggplot(house, aes(x=log(price))) +
geom_boxplot(outlier.colour = adjustcolor('firebrick',.6), outlier.size = 3) +
ggtitle('Boxplot of LOG Price') + theme_light()Can you see 5 values and outliers above?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.35 12.67 12.99 13.03 13.35 15.48
The left end red point is the minimum (11.35) and right end red point is the maximum (15.48). The median is slightly left to 13, the 1st quartile (12.67) and 3rd quartile (13.35) are left and right edges of the box.
It is very powerful for summarizing important information in one single frame:
ggplot(house, aes(y = zipcode, x=price, color = zipcode)) +
geom_boxplot() +
ggtitle('Boxplot of Price') +
theme_light() +
theme(legend.position = 'none')Which neighbourhood seems more expensive? In the neighbour, zipcode = 98040, median price of a house is greater than any other neighbourhood, also the most expensive house in our dataset is located there.
Do you want to see where is 98040?
A better practice is to sort the boxplots based on their median values. This can be done by giving order to the zipcode data. Zipcodes are numbers, so by default R reads them as numeric, however we want them to be treated as level data. So we will overwrite zipcode as below:
zipcodes <- unique(house$zipcode)
medians <- sapply(zipcodes, function(z) median(house$price[house$zipcode == z]))
ord <- order(medians) # the order
house$zipcode <- factor(house$zipcode , levels=zipcodes[ord])
ggplot(house, aes(y = zipcode, x=price, color = zipcode)) +
geom_boxplot() +
ggtitle('Boxplot of Price') +
theme_light() +
theme(legend.position = 'none')Jitter Plots, Violin Plots and more
Boxplots are used a lot in practice but as we discussed above they lose some information. Another problem with boxplot is it shows too much outliers than actually there is. Below there are three options to overcome the problem:
Jitter
ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
geom_jitter(alpha=.6) +
theme_light() +
theme(legend.position = 'none')Jitter & Boxplot
ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
geom_jitter(alpha=.4, position = position_jitter(.2)) +
geom_boxplot(color='black',fill=adjustcolor('grey90',.3),outlier.alpha = 0) +
theme_light() +
theme(legend.position = 'none')Density Ridges
library('ggridges')
ggplot(house, aes(y=zipcode, x=log(price), fill=zipcode)) +
geom_density_ridges() +
theme_minimal() +
theme(legend.position = 'none')Violin
ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
geom_violin(fill=adjustcolor('grey75',0.7)) +
theme_light() +
theme(legend.position = 'none')Cleveland Dot
No worries, we will learn
medians <- group_by(house, zipcode) %>% summarise(price=median(price))
ggplot(medians, aes(y=zipcode)) +
geom_point(aes(x=price), colour='firebrick') +
geom_segment(aes(x=200000, yend = zipcode, xend=price), colour = 'grey', alpha=.5) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(legend.position = 'none') +
ggtitle('Median prices in different neighbourhoods')The jitter plot can add the information about data intensity in certain locations, for example number of houses in locations with zipcode 98032, 98070 and 98109 are not as much as 98059. This information lacks in boxplot. But jitter doesn’t have the important information such as median and quartiles. The overlap adds this.
Also violin plots can give it for us. It is adding by changing the boxes to symmetric density plots.
Joint distributions
When we have more than one variable we will try to understand the relationship between them. We already have seen one, scatter plot and we will explore it more later.
Before beginning, for sake of simplicity, we will reduce the number of regions in the dataset:
A scatter plot is showing the joint distribution. Below you can see that the data is dense in certain regions. This is analogical to histograms where the higher bins refer to more data.
library('ggExtra')
g <- ggplot(house.small, aes(y=price, x=sqft_living)) +
geom_point(alpha=.05, size = 3) +
theme_light() +
theme(legend.position = 'none')
ggMarginal(g, type="histogram", size=10)The more dense regions corresponds to high peaks of the histograms. There are 2D versions of histogram and density which visualize the depth not with the height of the bar, but colour of it.
2D Histograms and Densities
2D Histogram
ggplot(house.small, aes(y=price, x=sqft_living)) +
geom_bin2d(bins = 15, colour='white') +
theme_light() From the plot we can say that in Seattle Washington for most houses, 1250 < sqft_living < 1875 and 330,000 < price < 660,000.
Hex
ggplot(house.small, aes(y=price, x=sqft_living)) +
geom_hex(bins = 15, colour='white') +
theme_light() Hex Bins are more precise. We can say that sqft_living of most of the houses are ranging between 1500 to 2100 and price is around 500,000 - 600,000.
2D Density
ggplot(house.small, aes(y=price, x=sqft_living)) +
stat_density_2d(aes(fill=..level..), geom = 'polygon', colour = 'white') +
theme_light() Here we can interpret similarly. Majority of the houses are around 1500 sqft and $600,000
Comparing more than two variables (Pairs plot)
There are different ways for pairs plot. We will use a ggplot friendly option. For this you will need GGally package. Before plotting let’s remember the column names:
## [1] "X" "id" "date" "price"
## [5] "bedrooms" "bathrooms" "sqft_living" "sqft_lot"
## [9] "floors" "waterfront" "view" "condition"
## [13] "grade" "sqft_above" "sqft_basement" "yr_built"
## [17] "yr_renovated" "zipcode" "lat" "long"
## [21] "sqft_living15" "sqft_lot15"
Recall that sqft* variables are continuous as with the price. Let’s start by plotting three continuous variables, sqft_living,sqft_basement and sqft_above:
# install.packages('GGally')
library('GGally')
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = c('price','sqft_living','sqft_basement','sqft_above'),
upper='blank',
diag=NULL) +
theme_minimal()The pairs plot shows that sqft_living, sqft_basement and sqft_above are co-related. Basement size is rarely larger than sqft_above as the basement usually less than maximum width. sqft_living is the sum of all living spaces, and therefore is greater than or equal to sqft_above.
Notice that each of the combinations are scatter plots and show us about joint distribution of two variables. You can also see that diagonals are joint distribution of a variable with itself, and therefore univariate distributions. So by removing diag = NULL we can add more information:
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = c('price','sqft_living','sqft_basement','sqft_above'),
upper='blank') +
theme_minimal()ggpairs has many default options that we can tweak. You can tell the algorithm to plot points with smooters when it encounters two continuous variables, barplot when it encounters two discrete variables and boxplots when one variable is continuous and the other is discrete. Here are the options
- continuous:
points,smooth,density,cor,blank - discrete:
ratio,facetbar,blank - combo:
box,dot,facethist,facetdensity,denstrip,blank
Let’s add two discrete variables to columns to plot list put histogram
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = c('sqft_living','sqft_above', 'floors','zipcode'),
upper=list(continuous = 'density', combo = 'box', discrete='ratio')) +
theme_minimal()For more details check this link.
Please be aware that too much complexity in plots increases the brainpower to parse the information and disgust the reader, eventually your report will be treated as non-credible, ugly, sloppy etc. It is better to not to plot continuous with discrete.
Some good examples
Empty Upper
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
cont_vars <- colnames(house.small)[grepl('sqft_', colnames(house.small))]
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = cont_vars,
upper='blank') +
theme_void()Symmetric
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
cont_vars <- colnames(house.small)[grepl('sqft_', colnames(house.small))]
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = cont_vars,
upper=list(continuous = 'points')) +
theme_void()With Density
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
cont_vars <- colnames(house.small)[grepl('sqft_', colnames(house.small))]
ggpairs(house.small, aes(colour = zipcode, alpha = 0.4),
columns = cont_vars,
upper=list(continuous = 'density')) +
theme_void()Custom Mapping
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
cont_vars <- colnames(house.small)[grepl('sqft_', colnames(house.small))]
ggpairs(house.small,
columns = cont_vars, aes(alpha=.3),
lower=list(continuous = 'points', mapping = aes(color = zipcode)),
upper=list(continuous = 'density')) +
theme_void()Custom Function
house.small$zipcode <- as.factor(as.character(house.small$zipcode)) # just to reduce number of levels
cont_vars <- colnames(house.small)[grepl('sqft_', colnames(house.small))]
bin2d <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_bin2d(...) +
scale_fill_gradient(low = "steelblue", high = "firebrick")
}
bin1d <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_histogram(..., colour='white', fill='black')
}
ggpairs(house.small,
columns = cont_vars,
diag = list(continuous = bin1d),
lower = list(continuous = 'points',mapping = aes(colour = zipcode, alpha=.3)),
upper = list(continuous = bin2d)) +
theme_void()Deliverables
Choose one of the dataset uploaded on LEARN and follow the below instructions to develop insights.
1. Data structure
- Print the dimensions of the data
- Print the structure (str) of the data
- Identify and write names of all the continuous variables in your data
- Identify and write names of all the non-categorical (continuous and numeric discrete that are not categorical) variables in your data
- Convert the discrete numeric but categorical variables to categorical variables
as.factor(...). See Lab 4 “Continuous and Discrete, Numeric and String, Black and Blue” for details. - Tell about your data in few lines combining the information you printed in previous steps.
2. Distributions
Choose one continuous variable (similar to price) in your data that is important to analyze, and choose a categorical variable (similar to zipcode).
- Draw histogram. Assign
fillto any colour you like, setcolour='white'andalpha = 0.5
- In additional to the original plot, try LOG, SQRT transformation and state which one looks similar to normal distribution
- Comment on the distribution properties, whether it is skewed, location of the median, whether there is one subset differs from other or so on.
- Change the theme to
theme_minimal() - Comment on the distribution properties, whether it is skewed, location of the median, whether there is one subset differs from other or so on.
- Draw another histogram. Map categorical variable to
fill, setcolour='white'andalpha = 0.5, and addfacet_wrapfaceting to the variable
- Change the theme to
theme_minimal() - Comment on the distribution properties, whether it is skewed, location of the median, whether there is one subset differs from other or so on.
3. Comparing Distributions
Choose one continuous data and one categorical variable (can be the same above)
- Subset the data using two values in the categorical variable (similar to zipcode == 98040 and 98105)
- Draw a back to back plot comparing distributions of these two subgroups.
- Comment on what you see, e.g. compare the medians, compare the histograms and tell what it means.
- Choose one of boxplot, jitter, violin or density_ringes.
- Draw the boxplot (or whatever you have chosen) that will break down the categorical variable.
- This time don’t subset to the two values you used in the previous step, use all.
- Map the variable to
colourorfill. - Set
alpha = 0.5and change theme totheme_minimal() - (Optional) Make sure to order the boxes (or whatever) with respect to median.
- Comment on what you see, e.g. compare the medians, min, max etc. and tell what it means.
4. Scatter plot
- Choose two continuous variables to visualize the relation, also choose one categorical variable.
- Draw a scatterplot of the two, map the categorical variable to
colour, setalpha=0.3or less - Add a smoother to your data
- If you feel it is more convenient for visualization, add
facet_wrap(...). - Is there a relation between the variables? Does the relation changes w.r.t. the category?
5. Pairs plot
- Draw pairs plot between all continuous variables
- Comment on what you see. Is there a relation between any combinations?